Prediction of postpartum prediabetes by machine learning methods in women with gestational diabetes mellitus

Summary Early onset of type 2 diabetes and cardiovascular disease are common complications for women diagnosed with gestational diabetes. Prediabetes refers to a condition in which blood glucose levels are higher than normal, but not yet high enough to be diagnosed as type 2 diabetes. Currently, there is no accurate way of knowing which women with gestational diabetes are likely to develop postpartum prediabetes. This study aims to predict the risk of postpartum prediabetes in women diagnosed with gestational diabetes. Our sparse logistic regression approach selects only two variables – antenatal fasting glucose at OGTT and HbA1c soon after the diagnosis of GDM – as relevant, but gives an area under the receiver operating characteristic curve of 0.72, outperforming all other methods. We envision this to be a practical solution, which coupled with a targeted follow-up of high-risk women, could yield better cardiometabolic outcomes in women with a history of GDM.


INTRODUCTION
Gestational diabetes mellitus (GDM) is defined as any degree of prediabetes with onset or first recognition during pregnancy.4][5][6] GDM is associated with an increased risk of cardiovascular dysfunction, including rise in cardiovascular risk factors like blood pressure, and adverse changes in cholesterol and triglycerides. 7However, this risk is not the same for all women diagnosed with GDM.
There is some evidence that glucose levels during pregnancy are predictive of prediabetes. 8,9Retnakaran et al. 10 have shown that the risk of dysglycamia at 12 weeks postpartum increases across the groups from normal glucose challenge test (GCT) and Normal Glucose Tolerance (NGT), to abnormal GCT and NGT, to gestational impaired glucose tolerance (GIGT), to GDM.This has been supported by other studies. 11,12igher fasting glucose shows a high tendency of conversion to T2DM in the postpartum period 7,13 and antenatal fasting glucose > 5.7 mmol/L is considered to be an important antenatal variable for the prediction of postpartum abnormal glucose metabolism. 14long with glucose values in pregnancy, many studies have proposed the significance of gestational age at the time of diagnosis of GDM in predicting postpartum prediabetes. 15,16Specifically, women diagnosed at 24 weeks of gestation or earlier are at higher risk of having postpartum prediabetes. 17Similarly, the requirement of insulin therapy during pregnancy, ethnicity, gravidity, BMI, weight at the time of delivery, and neonatal weight are other factors that have been shown to be associated with the risk of prediabetes. 18While there is ample evidence of multiple factors being associated with T2DM onset in GDM-diagnosed women in general, there is no personalized risk score that can predict whether a specific GDM-diagnosed woman is likely to develop prediabetes or T2DM.Indeed, identifying women who are especially at high risk can help in implementing targeted, personalized interventions to delay and prevent the onset of T2DM and its future complications.
Artificial intelligence has begun to play a dominant role in healthcare, facilitating optimal decision-making as well as personalized treatment.Although Kumar et al. 19 and Muche et al. 20 have shown evidence of using machine learning for predicting progression of GDM to postpartum Type 2 diabetes, its use in the development of predictive models for T2DM onset is still in its nascent stages.Accurate prediabetes risk stratification at or before delivery for GDM women could assist policymakers and clinicians in specifically targeting those at the highest risk, especially in resource-constrained settings.
Postpartum screening is poor in many parts of the world as women have many competing interests on their time during this period. 21,22We and others have shown that women who miss postpartum screening had higher cardiometabolic risk factors. 23,24While dedicated healthcare administrators can improve the screening, this is still suboptimal.Therefore, a strategy that is personalized by identifying who is at risk of developing postpartum prediabetes/diabetes could help healthcare professionals for targeted education on the importance of screening for prediabetes/diabetes following a GDM pregnancy.
The primary aim of this paper is to investigate the predictive ability of the antenatal variables and derive a model for personalized prediction of prediabetes.We explored the use of logistic regression (LR) and tree-based machine learning algorithms for developing the prognostic model.We report our findings on a multi-ethnic retrospective cohort in the UK.

Data acquisition
A retrospective audit of electronic database records of postpartum screening at 6 to 13 weeks of women diagnosed with GDM, from January 2016 to December 2019, was conducted at an NHS trust hospital in the UK.GDM was diagnosed using NICE 2015 criteria. 25Complete data are available for 607 women for the following variables: age, height, weight, BMI, systolic and diastolic BP at booking, ethnicity, gravida, parity, smoking status, married status, employment status, gestational age at delivery, mode of delivery, birth weight, breastfeeding status, and biochemical variables such as antenatal fasting glucose (A-FG), antenatal postprandial glucose (A-PG), antenatal HbA1c (A-HbA1c), postpartum fasting glucose (P-FG), postpartum postprandial glucose (P-PG), and postpartum HbA1c (P-HbA1c).Postpartum oral glucose tolerance test (OGTT) was carried out at 6 weeks, and following the change in the NICE guidelines, postpartum HbA1c was carried out at 12-13 weeks following delivery.We define prediabetes as: P-FG R 5. mmol/L OR P-PG R 7.8 mmol/L OR P-HbA1c R 40 mmol/mol ppIFG was defined as P-FG R 5.6 mmol/L and ppIGT was defined as P-PG R 7.8 mmol/L, respectively.We define T2DM as: P-FG R 7.0 mmol/L or P-PG R 11.1 mmol/L or P-HbA1c R 48 mmol/mol. 26NGT is considered otherwise.We provide the definitions of Normalcy, Prediabetes, and Incident diabetes based on the different measures in Table 1.

Statistical power analysis
We did a power analysis to determine if the available sample size was sufficient to identify the difference in effect between the normal and prediabetes-diagnosed GDM women.We used the statsmodels library and the TTestInd-Power class in Python to calculate the power analysis for Student's t test for independent samples.For a statistical power of 90%, a minimum sample size of 130 (99 normal and 31 prediabetes) is required for the observed effect size calculated using Cohen's d statistic.We provide the details of power analysis in the supplementary material.

Machine learning
We perform machine learning (ML) in Python version 3.7.We compare LR with tree-based methods to build the prognostic model for the prediction of early prediabetes in GDM women.These algorithms inherently address the imbalance in the representation for each of the binary classes of prediabetes outcome using the 'balanced' parameter.The 'balanced' mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data, as the ratio of the total number of samples to the product of the number of classes and the number of occurrences in each class.Mathematically, the class weight is calculated as 1/(2 3 fraction of women in the class).We build the tree-based model using a simple decision tree algorithm, whose performance improves using ensemble methods such as bagging and boosting.All these algorithms use hyperparameters that can significantly affect the performance of these methods on an unseen set.We determine the optimal values of these hyperparameters using nested cross-validation.More specifically, we make the entire data undergo leave-one-out cross-validation (CV1) for model evaluation and we perform an internal stratified 4-fold cross-validation (CV2) on the training folds of CV1 for hyperparameter optimization.We impute the missing values with the Multivariate Imputation by Chained Equations (MICE) technique, using the other non-missing covariates.We scale the training data in CV1 using the StandardScaler function and use the saga solver in the LR model.The saga solver is a variant of the stochastic average gradient (sag) solver that also supports the non-smooth L1 penalty, which promotes feature selection.The tree-based algorithms perform feature selection inherently, governed by the optimized hyperparameters in CV2.We perform hyperparameter optimization and model training only on the training folds (n À 1 samples) in CV1, with an independent set (1 sample) exclusively held out for testing.We aggregate the model predictions on each held-out sample across the n training folds of CV1 and plot the Receiver Operating Characteristic (ROC) curve for this aggregated set.We use the area under the ROC curve as a measure of performance.Finally, we apply it in a similar fashion on the full data to obtain the final model for deployment (Figure S3).We provide the details of the different tree-based methods employed in the supplementary materials.

Composite risk score calculation
Using the coefficients from the final fitted LR model on the full data, we develop a composite risk scoring system using the best selected antenatal variables to predict the probability of prediabetes in GDM-diagnosed women.We calculate the composite risk score as the probability of class 1 obtained from the LR model.It is given by the expression 1/(1 + e Àb ), where b where b 0 is the intercept and b m coefficient of mth variable (x m ), respectively.
We compute specificity, positive predictive value (PPV), negative predictive value (NPV), accuracy, and the F1 score at five predetermined values of sensitivity (60%, 70%, 75%, 80%, and 90%) for the optimal selected model.We give the definition/formulae for all these in the supplementary information section.

Kullback-Leibler (K-L) divergence and information graphs to evaluate and compare diagnostic tests and select optimal cutpoint
We use the information theory approach in Lee et al., 27 Samawi et al., 28 and Benish et al., 29 briefly summarized below, to select the optimal probability threshold for accurate prediction of the binary outcome of prediabetes.An important approach followed in medical diagnostics is to predict the 'rule-in and rule-out' potential of the diagnostic test to safely include the patients in need of treatment and discard those not in need, respectively.At a probability threshold c reported by the ML algorithm, suppose the proportion of the diseased population correctly predicted as diseased is given by g 1 (c) and that of the non-diseased population correctly predicted as non-diseased is given by g 2 (c).Both g 1 (c) and g 2 (c) are Bernoulli probability distributions and are simply the sensitivity and specificity, respectively at the threshold value of c.The K-L divergence (or relative entropy) measures the separation between these two probability distributions and is given by: Dðg 1 kg 2 Þ = g 1 ðcÞ 3 ln g 1 ðcÞ 1 À g 2 ðcÞ + ð1 À g 1 ðcÞÞ 3 ln 1 À g 1 ðcÞ g 2 ðcÞ (Equation 1)  2) By definition, D(g 1 kg 2 ) R 0, D(g 2 kg 1 ) R 0. The KL divergence is close to 0 when there is little difference between the two distributions.A high D(g 1 kg 2 ) value indicates the increase in information of predicting disease onset.We calculate D(g 1 kg 2 ) and D(g 2 kg 1 ) for 1000 cut points at an interval of 0.001 from 0 to 1.We chose T in with cut-point c in corresponding to D max (g 1 kg 2 ) as the diagnostic test with greatest rule-in potential.We chose T out with cut-point c out corresponding to D max (g 2 kg 1 ) as the diagnostic test with greatest rule-out potential.We calculate P in = e D(g1kg2) , which is the ratio of post-test odds to the pre-test odds of having the disease for a randomly selected diseased individual.We also calculate P out = e (D(g2kg1) , which is the ratio of pre-test odds to the post-test odds of having the disease for a randomly selected nondiseased individual.P in , P out R 1.
Next, we calculate the Information Distinguishability measure, ID(g g1kg2) and ID(g 2 kg 1 ) = 1 À e ÀD(g2kg1) , to study and compare the separation provided by the diagnostic test between the diseased and the non-diseased distributions.We calculate the objective function TKL discrete (c) = D(g 1 kg 2 ) + D(g 2 kg 1 ) and chose the optimal cut-point c inÀout corresponding to max(T KL discrete (c)) to achieve maximum information for T inÀout with high potential in both rule-in and rule-out situations.Further, we plot information graphs to characterize and compare the performance of our diagnostic tests at different cut-points depending upon the rule-in or rule-out potential.The expected value of the relative entropy provides a measure of the expected diagnostic information and plotting it as a function of the pre-test probabilities yields an information graph.The equations used to plot the information graphs are given as follows: Let D i be the true status and T i be the diagnostic test result for the patient, respectively, (i = {0, 1}, 0: disease absent, & 1: disease present).If x = Pr(D 1 ), then the diagnostic information obtained from a +ve, and -ve test result (I + (x), I À (x), respectively) and the expected diagnostic information (IE(x)) are given as follows.3) 4) I E ðxÞ = x 3 g 1 ðcÞ 3 lnðg 1 ðcÞÞ + ð1 À xÞ 3 ð1 À g 2 ðcÞÞ 3 lnð1 À g 2 ðcÞÞ + x 3 ð1 À g 1 ðcÞÞ 3 lnð1 À g 1 ðcÞÞ + ð1 À xÞ 3 g 2 ðcÞ 3 lnðg 2 ðcÞÞ À PrðT 1 Þ 3 lnðPrðT 1 ÞÞ À ð1 À PrðT 1 ÞÞ 3 lnð1 À PrðT 1 ÞÞ (Equation 5) = x 3 g 1 ðcÞ + ð1 À xÞ 3 ð1 À g 2 ðcÞÞ (Equation 7) In addition, we also plot the information graph by representing the total K-L divergence as the discrete Bregman divergence, which is the sum of the vertical distances between the negative Shannon entropy function (see supplementary material for details) and tangents to it at probabilities p = g 1 (c) and p = 1 À g 2 (c).

Decision curve analysis
We carry out decision curve analysis (DCA) to evaluate and compare the performance of our model in comparison to the 'treat all' and 'treat none' approaches.Finally, we compare the correctly identified non-attenders (sensitivity) vs. follow-ups avoided (the true negatives + false negatives, obtained from the optimal selected model), to calculate the number of women requiring enhanced care, to maximize targeted postpartum follow-up.

Machine learning analysis
The data are imbalanced (as expected), with a 23.35% representation of the positive prediabetes class.We compare simple LR with different classification tree methods for predicting prediabetes from training on this small and imbalanced dataset.We use class-weight = balanced in the LR algorithm and 'balanced' classification tree-based algorithms from the imbalanced-learn python package for developing the treebased prognostic models.The predictive performance of our proposed framework improves significantly by applying ensemble methods of bagging and boosting to the base decision tree estimator but remains lower than LR.LR gives the area under the ROC curve of 0.7203 from aggregating the test predictions from the leave-one-out cross-validation (Figure 2A).The Brier score loss for calibration of the LR model  S6).Using the base decision tree algorithm and leave-one-out cross-validation, the area under the ROC curve for the aggregated test predictions is 0.6210, bagging decision trees improves it to 0.6883.Random forests further improve it to 0.6944 using 4-fold stratified cross-validation in CV1 and the maximum area under the ROC curve from the tree-based algorithms is 0.6991 from balanced bagging using histogram-based gradient boosting tree classification algorithm using 4-fold stratified cross-validation (Figure S4).We use 4-fold stratified cross-validation in CV1 instead of leave-one-out for random forests and the boosting algorithm due to the high time complexity of leave-one-out.Other boosting algorithms like XGBoost, LightGBM, and CatBoost give the area under the ROC curve of 0.6427, 0.6646, and 0.6948 respectively.We conclude that the simplest prediction algorithm for binary classification, LR, outperforms the advanced tree-based methods in the prediction of prediabetes.Our final composite risk score using the LR model with A-FG and A-HbA1c is highly robust for the prediction of prediabetes in GDM women.Out of the n = 394 runs of leave-one-out cross-validation, antenatal fasting glucose and antenatal HbA1c are selected 318 (> 80%) times.The shap summary plots generated using the tree explainer package in Python provide additional evidence supporting the finding that A-FG and A-HbA1c are the sole significant predictors of postpartum prediabetes in women with GDM (Figures S7 and S8).

Composite risk score calculation
Based on our proposed final LR model, we calculate the composite risk score, c (or P(prediabetes)), as, P À prediabetes Á = 1 1+e À ðÀ 8:36+0:58 3 A À FG+0:10 3 A À HbA1cÞ (Equation 8) The association results of the LR model between the risk predictors and pre-diabetes outcome are given in Table 3.
Kullback-Leibler (K-L) divergence and information graphs to evaluate and compare diagnostic tests and select optimal cutpoint T in with D max (g 1 (c), g 2 (c)) = 0.30 and c in = 0.381 has high specificity of 92%, in concurrence with the 'rule-in-specific-test' principle and T out with D max (g 2 (c), g 1 (c)) = 0.28 and c out = 0.140 has high sensitivity of 92%, again in concurrence with the 'r-out-sensitive-test' principle.P in = 1.35 and P out = 1.23 for T in , and P in = 1.21 and P out = 1.33 for T out , which is the increase (decrease) in disease odds after the test for a diseased (control) individual.T inÀout with max(T KL discrete (c)) = 0.51 for c inÀout = 0.260 has P in = 1.31 and P out = 1.27.Also, maximum of the Youden's index, J max = 0.34 (J(c) = g 1 (c) + g 2 (c) À 1), and maximum F 1 -score = 0.49 occurs at the same c inÀout = 0.260.e (Tin(KLin)ÀTout(KLin)) = e (0.30À0.19) = 1.12 > 1, which implies that positive result obtained by T in is more likely to be true than positive result obtained by T out .In other words, T in is more specific and yields fewer false positives compared to T out .Similarly, e (Tin(KLout)ÀTout(KLout)) = e (0.21À0.28) = 0.93 < 1 shows that T in is less sensitive with more false negatives.We generated the information graphs using the equations for I + (x), I À (x), and I E (x) as a function of x = Pr(D 1 ), as shown in Figures 3A-3C.We can observe that T in provides the most diagnostic information when the test result is positive, and the pre-test probability of a positive result (Pr(D 1 )) is low.T out provides the most diagnostic information when the test result is negative, and the pre-test probability of a positive result is high.For T inÀout , we obtain more diagnostic information when the test yields a positive result than a negative one and we obtain maximum information from a positive result at a lower pre-test probability than that from the negative result.In Figure 3D, we can see the information gained using the discrete Bregman divergence representation of TKL discrete by adding the vertical distances from the negative Shannon Entropy function to the tangents drawn at probability p = g 1 (c) and 1 À g 2 (c).
Using the prognostic model with LR, 15 out of 100 women are above the optimal threshold of 0.381, and focusing on these women could improve the early prediabetes diagnosis.28 out of 100 women are below the optimal threshold of 0.140, and testing for early prediabetes diagnosis can be safely avoided in this category.The model shows 92% sensitivity for the rule-in test and 92% specificity for the rule-out test, Table 4 shows the sensitivity, specificity, PPV, NPV, F1 score, accuracy, and other measures related to K-L divergence at different probability thresholds.

Decision curve analysis
In the decision curve analysis by comparing the 'treat all' and 'treat none' approaches, the ML model obtains a higher standardized net benefit as compared to the universal screening of all GDM women for early prediabetes (Figure 2B).

DISCUSSION
In this study, we try to predict at the time of delivery if the women diagnosed with GDM are at high risk of getting diagnosed with postpartum prediabetes at 6-13 weeks postpartum.For this purpose, we employ a variety of machine learning techniques including both LR and advanced tree-based algorithms and train the models using routinely collected antenatal and delivery variables as predictors.Our proposed model using nested cross-validation and LR algorithm can effectively predict prediabetes in GDM women, using only the antenatal predictors fasting glucose and HbA1c, with good sensitivity and specificity.The proposed model has the capability to serve as a valuable tool for prediction and targeted screening for postpartum prediabetes in women with GDM during the antenatal period itself.By identifying individuals at higher risk, healthcare providers can implement timely interventions to target postpartum weight retention, which has shown to be an independent predictor of future prediabetes/diabetes, through personalized lifestyle modifications.This proactive approach can help to prevent or delay the onset of type 2 diabetes, improve long-term health outcomes, and reduce healthcare costs associated with managing diabetes-related complications.The use of machine learning for predicting postpartum prediabetes in GDM-diagnosed women has been rarely studied.We are aware of only two studies that have made use of machine learning algorithms to predict the occurrence of T2DM post-GDM: Kumar et al. 19 and Krishnan et al. 30 Krishnan et al. proposed random forest and Gaussian naive Bayes algorithms to predict T2DM after GDM, and achieved a modest specificity of 23% at a sensitivity of 88%.It also lacked the use of advanced techniques to deal with imbalanced data.Real-world medical data are scarce due to the different challenges posed in its collection.To the best of our knowledge, there is no larger data collected for studying prediabetes in GDM women than the data in the present study.In our study, we propose a more personalized approach to identifying postpartum prediabetes after GDM, at the antenatal visit itself, by calculating a simple score based on only two easy-to-measure biochemical predictors, obtained using machine learning techniques and a LR algorithm, with good sensitivity and specificity (each of 92% for rule-out and rule-in tests, respectively).Further, we suggest different cut-offs for classifying high-risk women depending upon resource availability.
The proposed prediction test needs only the antenatal fasting glucose (at the time of antenatal OGTT) and HbA1c, usually measured soon after the diagnosis of GDM for clinical use.Thus, no additional tests/costs are involved, and is easy to use by healthcare professionals.The information theory analysis proposes different cut-offs for classification according to the requirement of ruling-in or ruling-out the prediabetes condition in GDM-diagnosed women.All women diagnosed with GDM during pregnancy are recommended to have annual screening, 25,31 although the compliance is currently poor. 5,24Therefore, we can allow for more false positives than false negatives and propose c out = 0.140 as the optimal cut-off for classification.However, in low-resource settings, we can primarily focus on women with P (prediabetes) R c in = 0.381 and then consider women with P (prediabetes) R c inÀout = 0.260 in the following step.If resource constraint is not an issue, we can target women with P (prediabetes) R c out = 0.140 as well.Targeting GDM women stepwise according to their risk of developing prediabetes is more personalized than the blanket approach of targeting all women with GDM.This could be a pragmatic approach in settings with limited resources.The desired cut-off out of c in , c out , or c inÀout can be chosen depending upon the purpose and setting in which this diagnostic test is used.
Postpartum weight loss has been shown to reduce the risk of incidence of T2DM and recurrent GDM in the subsequent pregnancy. 32,33owever, initiating such lifestyle interventions can be difficult due to lack of personalization and may not produce optimum results due to poor adherence by the women. 34Our approach to identifying women with a high risk of prediabetes (using any 'c') can provide an improved understanding of individualized prediabetes risk which can be used to target women for interventions (diet and lifestyle, encourage breastfeeding, etc) for postpartum weight loss.This can in turn improve their T2DM and CVD risk profile.Women are most conducive to interventions during pregnancy and also maintain close contact with healthcare professionals.Identifying the high-risk women during the antenatal visits will help the healthcare professionals to implement necessary interventions throughout the remaining pregnancy period, and also encourage postpartum follow-up.These strategies can include personalized monitoring, education and support on lifestyle changes and early treatment, if necessary, for high risk women.Inexpensive medications such as metformin have been shown to prevent type 2 diabetes in women with a history of GDM and may provide added benefit in high risk women.In addition, empowering high risk women with knowledge about healthy lifestyle choices, self-care practices, and potential risk factors can facilitate informed decision-making and sustained behavior change.
We believe that the results obtained are supportive for testing and validating our rule-in and rule-out composite risk score approach on a larger prospective dataset.Also, real-world validation of machine learning models is an essential step in ensuring their effectiveness and reliability.Real-world validation of trained ML models requires an understanding of domain shift, continuous monitoring of model performance, data collection for recalibration, and the application of techniques like active learning, transfer learning, and domain adaptation.As and when we get access to more datasets of similar high quality from the field, the model can certainly be updated, ensuring, as in this paper, that there is no contamination of training data with test data during model updation.It would not be advisable to update ML models in real time on the field, because of the need to ensure data quality as well as lack of contamination in training the model.

Strengths and limitations
The key strength of our study is the use of a variety of machine learning techniques and the comparison of the LR algorithm with tree-based algorithms for developing the prognostic model for individualized risk prediction of prediabetes following GDM pregnancy.In addition, to

QUANTIFICATION AND STATISTICAL ANALYSIS
In statistics, power analysis is used to determine the probability of finding a significant difference between two sample distributions, if it exists.A statistical hypothesis test makes an assumption about the outcome.The null hypothesis in a statistical test is that there is no significant difference between specified populations, any observed difference is due to sampling or experimental error.The statistical power is the probability of correctly rejecting the null hypothesis.Therefore, in mathematical terms, power can be defined as probability of True positives (TP).For a predefined significance level and known effect size, we can either fix power and calculate minimum required sample size to obtain the desired effect or calculate power for the available sample size.Antenatal fasting (ANF) and antenatal HbA1c (ANHbA1c) are the two selected predictors for antenatal prediction of prediabetes in GDM diagnosed women.The sample distributions for ANF and ANHbA1c for the GDM (class 1) and non-GDM (class 0) groups are as shown in Figures S1A and S1B, respectively.Let r be the ratio of the number of samples in the second sample distribution to those in the first.Then r = Nobs2/Nobs1 = 92/302 = 0.305.Calculating effect size We will use the Cohen's d for calculating the effect size.Let h 1 , and h 2 be the number of samples in distribution 1 (class 0) and distribution 2 (class 1), respectively.Let m 1 , and m 2 be the means and s 1 , and s 2 be the standard deviations of the two sample distributions.Then, the Cohen's d statistic is given by (Cohen, Jacob.Statistical power analysis for the behavioral sciences.Academic press, 2013.):d = m 1 À m 2 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðh 1 À 1Þ,s 2 1 +ðh 2 À 1Þ,s 2 p h 1 +h 2 À 2 !(Equation 10) Assuming the sample distributions of ANF and ANHbA1c for class 0 and class 1 are normal, we get dANF = 0.681 and dHbA1c = 0.781.Calculating Sample size for fixed Power Let us fix significance level = 0.05 and statistical power p = 0.9.Using the Cohen's d calculated above, we get the minimum required sample size as 130 (99 class 0 + 31 class 1) for ANF and 99 (76 class 0 + 23 class 1) for ANHbA1c.Lastly, we plotted power curves to see how the power of the test changes with the other parameters: sample size, effect size, and significance level.In Figure S2, we can see how the power of the test increases with increasing sample size, for different fixed effect sizes.We can understand that if the effect size is small (greater overlap between the two sample distributions), then greater number of observations are required to identify the existing significant difference between the two sample distributions, and thus correctly reject the null hypothesis.Also, the power of the test increases with increasing effect size.Basic formulae F1 score: 2 3 Precision 3 Recall/(Precision + Recall) Negative Shannon entropy function: h(p) = p 3 ln (p) + (1-p) 3 ln (1-p).

Figure 1 .
Figure 1.Consort diagram of early postpartum glucose tolerance The flow chart displays the proportion of GDM women with and without prediabetes.The diagnosis of prediabetes was made if: FPG R5.6 or 2-h glucose R7.8 at postpartum OGTT or HbA1c R 40 mmol/mol.

Figure 2 .
Figure 2.Estimated ROC for the prediction of postpartum prediabetes following a GDM pregnancy (A) AUROC (Area under the receiver operating characteristic) was used to evaluate the performance of our machine learning-based method using the logistic regression model on the validation cohort, n = 394 by aggregating the predictions from the test folds of CV1.The area under ROC was 0.7203.The green dots on the ROC curve represent T in (c in = 0.381), T inÀout (c inÀout = 0.260), and T out (c out = 0.140), from left to right, respectively.(B)The decision curve analysis (DCA) showed the net benefit obtained from the ML (blue) prediction model.The net benefit of implementing our model in a clinical setting is larger when compared to the follow-up of all GDM women for prediabetes.DCA was derived from the equation, Net benefit TPÀFP3(p t /1Àp t ) = N , where TP and FP are the true positives and false positives respectively, p t is the probability threshold, and N is the total number of participants in the validation cohort, n = 607.

Figure 3 .
Figure 3. Information graphs for comparing rule-in and rule-out test potentials for predicting a low and high risk of prediabetes post-GDM Information graphs provide means to distinguish between diagnostic test performance.We compared the diagnostic information obtained from T out , T inÀout , and T in defined by the cut-points 0.140, 0.260, 0.381.A positive diagnosis made by the 'rule-in-specific-test' and a negative diagnosis made by the 'rule-out-sensitivetest' gives us the most information, as expected.(A-C)Maximum information from a positive test diagnosis (blue) is obtained at a lower pre-test probability than the maximum information from a negative test diagnosis (red).The diagnostic test with a lower cut-point gives maximum information when the diagnosis is negative (i.e., the test is very sensitive and we can rule out the negative cases safely) and the diagnostic test with a higher cut-point gives maximum information when the diagnosis is positive (i.e., the test is very specific to the disease and we can rule in the positive cases safely).I E is the expected information from the diagnostic test (x 3 I + + (1 À x) 3 I À , where x is the probability of a positive test diagnosis).(D) The sum of the distances between the tangents to the negative Shannon entropy function at p = g 1 (c) and p = 1 À g 2 (c) is the discrete Bregman divergence, which represents total K-L divergence.

Table 1 .
Definitions of Normalcy, Prediabetes, and Incident diabetes based on the different measures

Table 2 .
Comparison of antenatal, delivery and postnatal characteristics of GDM women with presence and absence of prediabetes is 0.1530 and the calibration plot is shown in Figure S9.The mean CV-accuracy as a function of the regularization constant 'C' is shown in Figure S5.LR gives the area under the ROC curve of 0.6598 for postpartum fasting glucose prediction (Figure

Table 3 .
Factors associated with postpartum prediabetes by machine learning model

Table 4 .
Performance of the diagnostic test for postpartum prediabetes at various probability thresholds